Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译
许多现有的自动驾驶范式涉及多个任务的多个阶段离散管道。为了更好地预测控制信号并增强用户安全性,希望从联合时空特征学习中受益的端到端方法是可取的。尽管基于激光雷达的输入或隐式设计有一些开创性的作品,但在本文中,我们在可解释的基于视觉的设置中提出了问题。特别是,我们提出了一种空间性特征学习方案,以同时同时进行感知,预测和计划任务的一组更具代表性的特征,称为ST-P3。具体而言,提出了一种以自我为中心的积累技术来保留3D空间中的几何信息,然后才能感知鸟类视图转化。设计了双重途径建模,以考虑将来的预测,以将过去的运动变化考虑到过去。引入了基于时间的精炼单元,以弥补识别基于视觉的计划的元素。据我们所知,我们是第一个系统地研究基于端视力的自主驾驶系统的每个部分。我们在开环Nuscenes数据集和闭环CARLA模拟上对以前的最先进的方法进行基准测试。结果显示了我们方法的有效性。源代码,模型和协议详细信息可在https://github.com/openperceptionx/st-p3上公开获得。
translated by 谷歌翻译
配备了广泛的传感器,主要的自主驾驶解决方案正变得越来越面向安全系统设计。尽管这些传感器已经奠定了坚实的基础,但最新的大多数生产解决方案仍然属于L2阶段。其中,Comma.ai出现在我们的视线中,声称一个售价999美元的售后设备装有单个相机和内部的木板具有处理L2场景的能力。该项目与Comma.ai发布的整个系统的开源软件一起名为OpenPilot。可能吗?如果是这样,它如何成为可能?考虑到好奇心,我们深入研究了OpenPilot,并得出结论,其成功的关键是端到端系统设计,而不是传统的模块化框架。该模型被简要介绍为SuperCombo,它可以从单眼输入中预测自我车辆的未来轨迹和其他道路语义。不幸的是,无法公开提供所有这些工作的培训过程和大量数据。为了进行深入的调查,我们尝试重新实现培训细节并测试公共基准测试的管道。这项工作中提出的重构网络称为“ op-Deepdive”。为了将我们的版本与原始SuperCombo进行公平的比较,我们引入了双模型部署方案,以测试现实世界中的驾驶性能。 Nuscenes,Comma2K19,Carla和内部现实场景的实验结果证明了低成本设备确实可以实现大多数L2功能,并且与原始的SuperCombo模型相当。在本报告中,我们想分享我们的最新发现,并阐明了从工业产品级别方面进行端到端自动驾驶的新观点,并有可能激发社区继续提高绩效。我们的代码,基准在https://github.com/openperceptionx/openpilot-deepdive上。
translated by 谷歌翻译
当前的端到端自动驾驶方法要么基于计划的轨迹运行控制器,要么直接执行控制预测,这已经跨越了两条单独研究的研究线。本文看到了它们彼此的潜在相互利益,主动探讨了这两个发展良好的世界的结合。具体而言,我们的集成方法分别有两个用于轨迹计划和直接控制的分支。轨迹分支可以预测未来的轨迹,而控制分支则涉及一种新颖的多步预测方案,以便可以将当前动作与未来状态之间的关系进行推理。连接了两个分支,因此控制分支在每个时间步骤中从轨迹分支接收相应的指导。然后将来自两个分支的输出融合以实现互补的优势。我们的结果在闭环城市驾驶环境中进行了评估,并使用CARLA模拟器具有挑战性的情况。即使有了单眼相机的输入,建议的方法在官方Carla排行榜上排名第一$,超过了其他具有多个传感器或融合机制的复杂候选人。源代码和数据将在https://github.com/openperceptionx/tcp上公开提供。
translated by 谷歌翻译
While mislabeled or ambiguously-labeled samples in the training set could negatively affect the performance of deep models, diagnosing the dataset and identifying mislabeled samples helps to improve the generalization power. Training dynamics, i.e., the traces left by iterations of optimization algorithms, have recently been proved to be effective to localize mislabeled samples with hand-crafted features. In this paper, beyond manually designed features, we introduce a novel learning-based solution, leveraging a noise detector, instanced by an LSTM network, which learns to predict whether a sample was mislabeled using the raw training dynamics as input. Specifically, the proposed method trains the noise detector in a supervised manner using the dataset with synthesized label noises and can adapt to various datasets (either naturally or synthesized label-noised) without retraining. We conduct extensive experiments to evaluate the proposed method. We train the noise detector based on the synthesized label-noised CIFAR dataset and test such noise detector on Tiny ImageNet, CUB-200, Caltech-256, WebVision and Clothing1M. Results show that the proposed method precisely detects mislabeled samples on various datasets without further adaptation, and outperforms state-of-the-art methods. Besides, more experiments demonstrate that the mislabel identification can guide a label correction, namely data debugging, providing orthogonal improvements of algorithm-centric state-of-the-art techniques from the data aspect.
translated by 谷歌翻译
音频命令是一种首选的沟通媒介,可将检查员保持在半自治无人机进行的民用基础设施检查环境中。为了了解一组异质和动态检查员的特定工作命令,需要为小组成本开发一个模型,并在组更改时很容易适应。本文的动机是建立一个具有股票分布的架构的多任务深度学习模型。该体系结构允许两个分类任务共享功能提取器,然后通过功能投影和协作培训在提取功能中交织在一起的特定主题和关键字特定功能。一组五个授权主题的基本模型对本研究收集的检查关键字数据集进行了培训和测试。该模型在分类任何授权检查员的关键字时达到了95.3%或更高的平均准确性。它在扬声器分类中的平均准确性为99.2%。由于该模型从合并的培训数据中学习的更丰富的关键字表示,因此将基本模型调整为新检查员只需要该检查员的少量培训数据,例如每个关键字五个话语。在验证授权检查员和76.1 \%的检测中,使用说话者分类分数进行检查员验证可以达到至少93.9%的成功率。此外,本文展示了所提出的模型对公共数据集上的大型组的适用性。本文为解决AI辅助人类机器人互动面临的挑战提供了解决方案,包括工人异质性,工人动态和工作异质性。
translated by 谷歌翻译
A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks 1st and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. Code will be available at https://github.com/dvlab-research/Imbalanced-Learning.
translated by 谷歌翻译
Weakly-supervised object localization aims to indicate the category as well as the scope of an object in an image given only the image-level labels. Most of the existing works are based on Class Activation Mapping (CAM) and endeavor to enlarge the discriminative area inside the activation map to perceive the whole object, yet ignore the co-occurrence confounder of the object and context (e.g., fish and water), which makes the model inspection hard to distinguish object boundaries. Besides, the use of CAM also brings a dilemma problem that the classification and localization always suffer from a performance gap and can not reach their highest accuracy simultaneously. In this paper, we propose a casual knowledge distillation method, dubbed KD-CI-CAM, to address these two under-explored issues in one go. More specifically, we tackle the co-occurrence context confounder problem via causal intervention (CI), which explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps. Based on the de-biased object feature, we additionally propose a multi-teacher causal distillation framework to balance the absorption of classification knowledge and localization knowledge during model training. Extensive experiments on several benchmarks demonstrate the effectiveness of KD-CI-CAM in learning clear object boundaries from confounding contexts and addressing the dilemma problem between classification and localization performance.
translated by 谷歌翻译
In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.
translated by 谷歌翻译
Nearest-Neighbor (NN) classification has been proven as a simple and effective approach for few-shot learning. The query data can be classified efficiently by finding the nearest support class based on features extracted by pretrained deep models. However, NN-based methods are sensitive to the data distribution and may produce false prediction if the samples in the support set happen to lie around the distribution boundary of different classes. To solve this issue, we present P3DC-Shot, an improved nearest-neighbor based few-shot classification method empowered by prior-driven data calibration. Inspired by the distribution calibration technique which utilizes the distribution or statistics of the base classes to calibrate the data for few-shot tasks, we propose a novel discrete data calibration operation which is more suitable for NN-based few-shot classification. Specifically, we treat the prototypes representing each base class as priors and calibrate each support data based on its similarity to different base prototypes. Then, we perform NN classification using these discretely calibrated support data. Results from extensive experiments on various datasets show our efficient non-learning based method can outperform or at least comparable to SOTA methods which need additional learning steps.
translated by 谷歌翻译